IAGLR 2021 H2O Workshop

1 Agenda

  • Welcomes and introductions – Timothy Maguire (UMich CIGLR) and Jo-fai Chow (H2O.ai) (5 mins)
  • Great Lakes research questions driven by data – Timothy Maguire (15 mins)
  • Introduction to the H2O.ai software – Jo-fai Chow (5 mins)
  • Live-code example of data manipulation and machine learning analysis – Jo-Fai Chow (20 mins)
  • Results in context – Timothy Maguire (10 mins)
  • Q&A (5 mins)

2 Welcome

welcome

3 About Great Lakes Research

slide_02

slide_03

slide_04

slide_05

slide_06

4 Introduction to H2O

h2o

h2o

h2o

h2o

h2o

5 Software and Code

5.1 Code

5.2 R Packages

  • Check out setup.R
  • For this tutorial:
    • h2o for automatic and explainable machine learning
  • For RMarkdown
    • knitr for rendering this RMarkdown
    • rmdformats for readthedown RMarkdown template
    • DT for nice tables

6 H2O Basics

# Let's go
library(h2o) # for H2O Machine Learning

6.1 Start a local H2O Cluster (JVM)

h2o.init()
 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 hours 54 minutes 
    H2O cluster timezone:       Europe/London 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.32.1.2 
    H2O cluster version age:    17 days  
    H2O cluster name:           H2O_started_from_R_joe_cqu561 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   7.80 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 4.0.5 (2021-03-31) 
# Optional settings
h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll()   # Optional: remove anything from previous session 
# Enter your lucky seed here ...
n_seed <- 12345

7 Data - Dom Lake Huron Abund

# Import CSV from GitHub
lake_data <- h2o.importFile('https://raw.githubusercontent.com/woobe/IAGLR_2021_H2O_Workshop/main/data/Dom_Lake_Huron_Abund.csv')
# Show first few samples
kable(head(lake_data, 5))
C1 Year Region Station Replicate lat lon depth substrate Season Na2O Magnesium MagnesiumOxide AluminumOxide SiliconDioxide Quartz PhosphorusPentoxide Sulphur DDT_TOTAL P.P.TDE P.P.DDE P.P.DDE.1 HeptachlorEpoxide Potassium PotassiumOxide Calcium CalciumOxide TitaniumDioxide Chromium Manganese ManganeseOxide XTR.Iron Iron IronOxide Cobalt Nickel Copper Zinc Selenium Strontium Beryllium Silver Cadmium Tin TotalCarbon OrganicCarbon Carbon.Nitrogen_Ratio TotalNitrogen Mercury Lead Uranium Vanadium Arsenic Chloride Fluoride Sand Silt Clay Mean_grainsize simpsonD shannon Chironomidae Oligochaeta Dreisseniidae Sphaeriidae Diporeia DPOL DBUG
1 2006 Saginaw Bay 10 1 43.94167 -83.62383 11 silt 3 1.2 6221.4 1.7 7.5 76.4 70.4 0.1 0 1.1 0.1 0.8 5.2 0.1 2527.1 2.4 1892.5 1.3 0.4 69.0 2034.5 0.2 3.3 24354.6 2.8 11.9 49.9 28.7 67.3 10.6 59.0 0.5 0.2 1.3 5.5 0.4 1.0 8.0 0.1 154.4 40.9 0.4 38.5 1.1 23.3 402.2 52.3 14.0 33.4 4.5 0.8645201 2.238382 0 3684.24 0.00 0.00 0 0.00 0.00
2 2006 Saginaw Bay 10 2 43.94167 -83.62383 11 silt 3 1.2 6221.4 1.7 7.5 76.4 70.4 0.1 0 1.1 0.1 0.8 5.2 0.1 2527.1 2.4 1892.5 1.3 0.4 69.0 2034.5 0.2 3.3 24354.6 2.8 11.9 49.9 28.7 67.3 10.6 59.0 0.5 0.2 1.3 5.5 0.4 1.0 8.0 0.1 154.4 40.9 0.4 38.5 1.1 23.3 402.2 52.3 14.0 33.4 4.5 0.8128663 1.961416 0 2270.52 0.00 0.00 0 0.00 0.00
3 2006 Saginaw Bay 10 3 43.94167 -83.62383 11 silt 3 1.2 6221.4 1.7 7.5 76.4 70.4 0.1 0 1.1 0.1 0.8 5.2 0.1 2527.1 2.4 1892.5 1.3 0.4 69.0 2034.5 0.2 3.3 24354.6 2.8 11.9 49.9 28.7 67.3 10.6 59.0 0.5 0.2 1.3 5.5 0.4 1.0 8.0 0.1 154.4 40.9 0.4 38.5 1.1 23.3 402.2 52.3 14.0 33.4 4.5 0.8504612 2.040843 0 4712.40 0.00 64.26 0 0.00 0.00
4 2006 Saginaw Bay 11 1 44.02050 -83.57367 9 silty sand 3 1.1 6160.2 1.5 6.5 79.5 74.5 0.1 0 1.1 0.1 0.7 5.3 0.1 2495.2 2.2 1830.8 1.2 0.3 55.3 2076.3 0.2 2.8 24068.9 2.3 9.8 41.0 23.7 54.8 10.5 49.4 0.4 0.2 1.1 5.4 0.4 0.8 7.1 0.1 150.2 35.6 0.4 31.3 0.9 23.2 402.3 61.0 11.3 27.4 3.9 0.7438019 1.754836 0 2034.90 0.00 0.00 0 0.00 0.00
5 2006 Saginaw Bay 11 2 44.02050 -83.57367 9 silty sand 3 1.1 6160.2 1.5 6.5 79.5 74.5 0.1 0 1.1 0.1 0.7 5.3 0.1 2495.2 2.2 1830.8 1.2 0.3 55.3 2076.3 0.2 2.8 24068.9 2.3 9.8 41.0 23.7 54.8 10.5 49.4 0.4 0.2 1.1 5.4 0.4 0.8 7.1 0.1 150.2 35.6 0.4 31.3 0.9 23.2 402.3 61.0 11.3 27.4 3.9 0.6572315 1.341344 0 3941.28 342.72 0.00 0 85.68 257.04
h2o.describe(lake_data)
      Label Type Missing Zeros PosInf NegInf       Min        Max       Mean
1        C1  int       0     0      0      0    1.0000  885.00000  443.00000
2      Year  int       0     0      0      0 2006.0000 2012.00000 2009.00113
3    Region enum       0    96      0      0    0.0000    3.00000         NA
4   Station enum       0    24      0      0    0.0000  122.00000         NA
5 Replicate  int       0     1      0      0    0.0000    3.00000    1.99887
6       lat real       0     0      0      0   43.2695   46.23333   44.68487
        Sigma Cardinality
1 255.6217909          NA
2   2.3727809          NA
3          NA           4
4          NA         123
5   0.8190319          NA
6   0.8554837          NA
 [ reached 'max' / getOption("max.print") -- omitted 62 rows ]
h2o.hist(lake_data$DPOL, breaks = 100)

h2o.hist(lake_data$DBUG, breaks = 100)

7.1 Define Target and Features

# Define targets
target_DPOL <- "DPOL"
target_DBUG <- "DBUG"

# Remove targets, C1, and Dreisseniidae (which is DPOL + DBUG)
features <- setdiff(colnames(lake_data), c(target_DPOL, target_DBUG, "C1", "Dreisseniidae"))

print(features)
 [1] "Year"                  "Region"                "Station"              
 [4] "Replicate"             "lat"                   "lon"                  
 [7] "depth"                 "substrate"             "Season"               
[10] "Na2O"                  "Magnesium"             "MagnesiumOxide"       
[13] "AluminumOxide"         "SiliconDioxide"        "Quartz"               
[16] "PhosphorusPentoxide"   "Sulphur"               "DDT_TOTAL"            
[19] "P.P.TDE"               "P.P.DDE"               "P.P.DDE.1"            
[22] "HeptachlorEpoxide"     "Potassium"             "PotassiumOxide"       
[25] "Calcium"               "CalciumOxide"          "TitaniumDioxide"      
[28] "Chromium"              "Manganese"             "ManganeseOxide"       
[31] "XTR.Iron"              "Iron"                  "IronOxide"            
[34] "Cobalt"                "Nickel"                "Copper"               
[37] "Zinc"                  "Selenium"              "Strontium"            
[40] "Beryllium"             "Silver"                "Cadmium"              
[43] "Tin"                   "TotalCarbon"           "OrganicCarbon"        
[46] "Carbon.Nitrogen_Ratio" "TotalNitrogen"         "Mercury"              
[49] "Lead"                  "Uranium"               "Vanadium"             
[52] "Arsenic"               "Chloride"              "Fluoride"             
[55] "Sand"                  "Silt"                  "Clay"                 
[58] "Mean_grainsize"        "simpsonD"              "shannon"              
[61] "Chironomidae"          "Oligochaeta"           "Sphaeriidae"          
[64] "Diporeia"             
ml_overview

7.2 Split Data into Train/Test

h_split <- h2o.splitFrame(lake_data, ratios = 0.75, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluation
dim(h_train)
[1] 656  68
dim(h_test)
[1] 229  68

7.3 Cross-Validation

CV

8 Worked Example - Target “DPOL”

8.1 Baseline Regression Models

  • h2o.glm(): H2O Generalized Linear Model
  • h2o.randomForest(): H2O Random Forest Model
  • h2o.gbm(): H2O Gradient Boosting Model
  • h2o.deeplearning(): H2O Deep Neural Network Model
  • h2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model (XGBoost) from DMLC

Let’s start with GBM

# Build a default (baseline) GBM
model_gbm_DPOL <- h2o.gbm(x = features,                   # All features
                          y = target_DPOL,            # Target
                          training_frame = h_train,   # H2O dataframe with training data
                          nfolds = 5,                 # Using 5-fold CV
                          seed = n_seed)              # Your lucky seed
# Cross-Validation
model_gbm_DPOL@model$cross_validation_metrics
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  4620.394
RMSE:  67.97348
MAE:  16.30104
RMSLE:  NaN
Mean Residual Deviance :  4620.394
# Evaluate performance on test
h2o.performance(model_gbm_DPOL, newdata = h_test)
H2ORegressionMetrics: gbm

MSE:  2613.399
RMSE:  51.12142
MAE:  12.70523
RMSLE:  NaN
Mean Residual Deviance :  2613.399

Let’s use RMSE

RMSE

Build Other Baseline Models (GLM, DRF, GBM, DNN) - TRY IT YOURSELF!

# Try other H2O models
# model_glm <- h2o.glm(x = features, y = target, ...)
# model_gbm <- h2o.gbm(x = features, y = target, ...)
# model_drf <- h2o.randomForest(x = features, y = target, ...)
# model_dnn <- h2o.deeplearning(x = features, y = target, ...)
# model_xgb <- h2o.xgboost(x = features, y = target, ...)

8.2 Manual Tuning

8.2.1 Check out the hyper-parameters for each algo

?h2o.glm 
?h2o.randomForest
?h2o.gbm
?h2o.deeplearning
?h2o.xgboost

8.2.2 Train a xgboost model with manual settings

model_gbm_DPOL_m <- h2o.gbm(x = features, 
                            y = target_DPOL, 
                            training_frame = h_train,
                            nfolds = 5,
                            seed = n_seed,
                                
                            # Manual Settings based on experience
                            learn_rate = 0.1,       # use a lower rate (more conservative)
                            ntrees = 120,           # use more trees (due to lower learn_rate)
                            sample_rate = 0.7,      # use random n% of samples for each tree  
                            col_sample_rate = 0.7)  # use random n% of features for each tree

8.2.3 Comparison (RMSE: Lower = Better)

# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("GBM: Gradient Boosting Model (Baseline)",
                               "GBM: Gradient Boosting Model (Manual Settings)"),
                     stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA

# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_gbm_DPOL@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_gbm_DPOL_m@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL_m, newdata = h_test))

# Show Comparison (RMSE: Lower = Better)
kable(d_eval)
model RMSE_cv RMSE_test
GBM: Gradient Boosting Model (Baseline) 67.97348 51.12142
GBM: Gradient Boosting Model (Manual Settings) 67.87922 51.06768

8.3 H2O AutoML

# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl_DPOL = h2o.automl(x = features,
                         y = target_DPOL,
                         training_frame = h_train,
                         nfolds = 5,                        # 5-fold Cross-Validation
                         max_models = 30,                   # Max number of models
                         stopping_metric = "RMSE",          # Metric to optimize
                         exclude_algos = c("deeplearning", "xgboost"), # exclude some algos for a quick demo
                         seed = n_seed)

8.3.1 Leaderboard

# Show the table
kable(as.data.frame(automl_DPOL@leaderboard))
model_id mean_residual_deviance rmse mse mae rmsle
GBM_grid__1_AutoML_20210517_094305_model_7 4022.796 63.42551 4022.796 15.04154 NA
GBM_grid__1_AutoML_20210517_094305_model_3 4057.498 63.69849 4057.498 15.14776 NA
GBM_grid__1_AutoML_20210517_094305_model_2 4083.103 63.89917 4083.103 15.55026 NA
GBM_3_AutoML_20210517_094305 4091.409 63.96412 4091.409 15.42055 NA
GBM_4_AutoML_20210517_094305 4112.423 64.12818 4112.423 15.27154 NA
GBM_grid__1_AutoML_20210517_094305_model_5 4171.660 64.58839 4171.660 17.13950 NA
GBM_2_AutoML_20210517_094305 4268.900 65.33682 4268.900 15.65400 NA
StackedEnsemble_AllModels_AutoML_20210517_094305 4329.084 65.79578 4329.084 15.60418 NA
StackedEnsemble_BestOfFamily_AutoML_20210517_094305 4343.234 65.90322 4343.234 15.36855 NA
GBM_grid__1_AutoML_20210517_094305_model_6 4359.818 66.02892 4359.818 16.03464 NA
GBM_grid__1_AutoML_20210517_094305_model_1 4550.951 67.46073 4550.951 16.47427 NA
XRT_1_AutoML_20210517_094305 4579.862 67.67468 4579.862 16.69115 1.591159
DRF_1_AutoML_20210517_094305 4878.819 69.84854 4878.819 14.79644 1.146071
GBM_1_AutoML_20210517_094305 5043.679 71.01887 5043.679 14.24917 NA
GBM_5_AutoML_20210517_094305 5173.990 71.93045 5173.990 19.08660 NA
GBM_grid__1_AutoML_20210517_094305_model_4 5289.884 72.73159 5289.884 15.41215 NA
GLM_1_AutoML_20210517_094305 6978.418 83.53693 6978.418 28.34823 NA

8.3.2 Best Model (Leader)

# Show the best model
automl_DPOL@leader
Model Details:
==============

H2ORegressionModel: gbm
Model ID:  GBM_grid__1_AutoML_20210517_094305_model_7 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              43                       43               10607         4
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         4    4.00000          9         16    13.79070


H2ORegressionMetrics: gbm
** Reported on training data. **

MSE:  742.9795
RMSE:  27.25765
MAE:  7.378285
RMSLE:  NaN
Mean Residual Deviance :  742.9795



H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  4022.796
RMSE:  63.42551
MAE:  15.04154
RMSLE:  NaN
Mean Residual Deviance :  4022.796


Cross-Validation Metrics Summary: 
                            mean         sd cv_1_valid cv_2_valid cv_3_valid
mae                    15.045994  3.9332395   12.12232   11.87795  17.959179
mean_residual_deviance 4025.1875  2845.8782  2456.1912  2354.3364   8599.125
mse                    4025.1875  2845.8782  2456.1912  2354.3364   8599.125
r2                     0.5540781 0.14167649  0.5431397  0.5116641 0.44749096
residual_deviance      4025.1875  2845.8782  2456.1912  2354.3364   8599.125
rmse                     60.5994  21.002983  49.559975  48.521503   92.73147
rmsle                        NaN        0.0        NaN        NaN        NaN
                       cv_4_valid cv_5_valid
mae                     20.492651  12.777869
mean_residual_deviance   4990.901  1725.3843
mse                      4990.901  1725.3843
r2                     0.46939796  0.7986977
residual_deviance        4990.901  1725.3843
rmse                     70.64631  41.537746
rmsle                         NaN        NaN

8.3.3 Comparison (RMSE: Lower = Better)

d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
                         RMSE_cv = automl_DPOL@leader@model$cross_validation_metrics@metrics$RMSE,
                         RMSE_test = h2o.rmse(h2o.performance(automl_DPOL@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)

# Show the table
kable(d_eval)
model RMSE_cv RMSE_test
GBM: Gradient Boosting Model (Baseline) 67.97348 51.12142
GBM: Gradient Boosting Model (Manual Settings) 67.87922 51.06768
Best Model from H2O AutoML 63.42551 45.14810

8.4 Making Predictions

# Make predictions
yhat_test <- h2o.predict(automl_DPOL@leader, newdata = h_test)

# Show the table
kable(head(yhat_test, 5))
predict
35.388011
2.586987
577.323965
3.505565
4.594125

8.5 Explainable AI (Results in Context)

8.5.1 Global Explanations

# Show global explanations for best model from AutoML
h2o.explain(automl_DPOL@leader, newdata = h_test)


Residual Analysis
=================

> Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.



Variable Importance
===================

> The variable importance plot shows the relative importance of the most important variables in the model.



SHAP Summary
============

> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.



Partial Dependence Plots
========================

> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.



Individual Conditional Expectations
===================================

> An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.

8.5.2 Local Explanations

# Show local explanations
h2o.explain_row(automl_DPOL@leader, newdata = h_test, row_index = 1)


SHAP explanation
================

> SHAP explanation shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.



Individual Conditional Expectations
===================================

> Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response for a given row. ICE plot is similar to partial dependence plot (PDP), PDP shows the average effect of a feature while ICE plot shows the effect for a single instance.

8.5.3 Local Contributions

# Make Predictions
predictions <- h2o.predict(automl_DPOL@leader, newdata = h_test)

# Show the table
kable(head(predictions, 5))
predict
35.388011
2.586987
577.323965
3.505565
4.594125
# Calculate feature contributions for each sample
contributions <- h2o.predict_contributions(automl_DPOL@leader, newdata = h_test)

# Show the table
kable(head(contributions, 5))
Year Region Station Replicate lat lon depth substrate Season Na2O Magnesium MagnesiumOxide AluminumOxide SiliconDioxide Quartz PhosphorusPentoxide Sulphur DDT_TOTAL P.P.TDE P.P.DDE P.P.DDE.1 HeptachlorEpoxide Potassium PotassiumOxide Calcium CalciumOxide TitaniumDioxide Chromium Manganese ManganeseOxide XTR.Iron Iron IronOxide Cobalt Nickel Copper Zinc Selenium Strontium Beryllium Silver Cadmium Tin TotalCarbon OrganicCarbon Carbon.Nitrogen_Ratio TotalNitrogen Mercury Lead Uranium Vanadium Arsenic Chloride Fluoride Sand Silt Clay Mean_grainsize simpsonD shannon Chironomidae Oligochaeta Sphaeriidae Diporeia BiasTerm
5.749369 0.1816891 -0.6550707 0.0513092 -1.121307 -3.587872 0.6589137 -1.3055155 -1.1573941 -0.3894154 -0.9148800 0.5581922 0.0810013 -0.0429149 0 0 0.1436372 -0.1850808 -0.1730304 -0.1873940 0.4948193 -0.0341789 0.5752563 -0.1505208 0.8201120 0.1439855 -0.1001838 0.0267208 4.236263 0 0 2.3089592 -0.1621301 1.0973696 -0.1827467 -0.1396517 -0.0555104 -1.093223 -0.0208815 0.3243970 0.3806432 0.3837683 -0.7344639 -0.1059432 0.0216723 1.9802601 0.0034754 -0.3391146 0.9240335 -0.0252398 -0.0795804 0.1114721 3.3446767 0.8472175 0.1447516 2.7301340 0.6274920 0.7287988 -2.3404412 -1.4094365 2.6234705 3.4465351 0.396596 -0.0002280 15.93437
4.999606 0.4208446 -6.8606038 0.6449652 -2.148659 -3.505474 10.5673618 3.7239215 0.1442628 -0.2539046 -0.1725378 -0.2667851 -0.0750485 0.1492278 0 0 0.0924014 -0.1850808 -0.0531525 -0.0666508 1.1268554 0.0786915 -4.6323361 0.1549840 1.1303115 -0.5802891 0.0175319 -1.7433980 1.978953 0 0 2.2018185 -0.1929169 0.0712235 0.0319320 -4.1429248 -0.0398978 -1.866422 -0.0096466 0.6521323 -0.2015527 -0.8943596 -1.8839849 -0.0093712 0.0096143 1.1426355 -0.0062303 -0.1420653 -3.0697949 -0.1489434 -0.9779817 -0.0369908 0.4098719 0.7658747 -0.0861650 0.8774084 -0.7691286 -0.1723884 -6.5317354 -1.7592351 0.8794661 -5.2189517 3.085558 -0.0002280 15.93437
54.240490 1.3683642 133.0601501 -2.8597975 48.040829 18.082205 49.4240570 22.9496117 -5.3234730 9.7090874 18.1731472 1.4052534 0.0370039 1.4064714 0 0 0.0019574 -0.0538082 -0.1943945 -0.0862775 7.8344207 -0.0116882 19.3110600 -0.0262053 16.5199566 0.0513210 -0.0295170 3.8383923 31.386347 0 0 14.5122032 -0.0840988 -0.1139140 -0.1669139 11.9667253 -0.0496545 7.111960 -0.0140443 -0.8022944 2.4959888 1.3647826 8.2488222 0.0065774 0.0096143 -0.4122683 -0.0044202 -0.1656154 5.7755022 -0.0164728 2.8159142 0.1114721 23.0695019 10.2007303 0.0533646 6.1467013 0.0001291 -0.0084739 17.2453384 12.1234789 17.2187881 -8.0120039 2.508808 -0.0016405 15.93437
5.659294 0.4179673 -4.3749590 -0.3660498 -2.256765 -3.953184 -1.5347352 -0.8403457 0.1382265 -0.3811278 -0.8222194 -0.1025687 -0.0143357 -0.0302356 0 0 0.0250903 -0.0521320 -0.1157115 -0.1829050 0.6061351 0.0398006 -1.0461478 -0.0527855 -0.4770761 0.0835655 0.0001885 -0.1409906 1.601940 0 0 0.8931869 0.1266488 -0.0090013 -0.0215884 -0.1440378 -0.0694553 -1.504640 -0.0489762 0.0892410 -0.2314712 0.1139337 -1.5825820 -0.0037396 0.0163746 -0.6378350 0.0034754 -0.3322367 0.5104063 -0.0212102 -0.0670387 -0.0369908 -0.7185428 0.6596766 -0.2510566 0.3896292 -0.1075576 0.1227657 -0.8817295 -0.5447686 0.5173434 -0.7864236 0.301690 -0.0002280 15.93437
5.831995 0.4179673 -5.0647283 0.3835035 -2.256765 -3.956096 -1.5347352 -0.8403702 0.6334915 -0.3811278 -0.7845055 -0.1025687 -0.0143357 -0.0302356 0 0 0.0250903 -0.0521320 -0.1157115 -0.1829050 0.7071946 0.0398006 -1.0832016 -0.0527855 -0.4770761 0.0728301 0.0001885 -0.1409906 1.552421 0 0 0.8931869 0.1283323 -0.0090013 -0.0215884 -0.1440378 -0.0694553 -1.504640 -0.0489762 0.0892410 -0.2314712 0.1139337 -1.6291332 -0.0037396 0.0163746 -0.6378350 0.0034754 -0.2791277 0.5104063 -0.0212102 -0.0670387 -0.0369908 -0.7154787 0.6079237 -0.2510566 0.3896292 -0.1075576 0.1292654 -0.7736058 -0.3036894 0.5241177 -0.7861704 0.301690 -0.0002280 15.93437

9 Quick Recap

recap_1

recap_2

why_1

9.1 Learning Resources

9.2 Other Use Cases

mb
  • Leveraging explainability in real-world ML applications Link
  • Practical Explainable AI: Loan Approval Use Case Link
  • My Numerai AI Hedge Fund Use Case: Presentation Link

10 Your Turn - Get Your Hands Dirty!

  • Try to build models for target “DBUG”. (Hint: you can change the target and reuse most of the code above.)
  • Instead of using DPOL and DBUG, try using other variables as targets and build predictive models. (Hint: change features and targets)
  • Try your own datasets.